Open Data Notebooks 2017-01 :: ODN2017-01

Case Study: Traffic Accidents in Belgrade 2015.

Data Set: OPENDATA_SNEZGODE_exc.csv

Source: data.gov.rs

Description: Traffic Accidents in Belgrade, 2015.


Author: Goran S. Milovanovic, Data Science Serbia

Notebook: 01/30/2017, Belgrade, Serbia


This notebook accompanied the first Open Data R Meetup in Belgrade, organized by Data Science Serbia in Startit Center, Savska 5, Belgrade, 01/30/2017. The event took place to introduce the Data Science community in Serbia with the Open Data initative people, and celebrate the completion of the 2nd Introduction to R for Data Science course that Data Science Serbia has provided in Startit, Belgrade. As of now, more than 45 people have learned - or are currently learning - to code in R for free, thanks to our efforts to develop the Introduction to R for Data Science course.

The notebook focuses on an exploratory analysis of the test open data set of Traffic Accidents in Belgrade in 2015, provided at the Open Data Portal of the Republic of Serbia that is currently under development. The data set was kindly provided to the Open Data Portal by the Republic of Serbia Ministry of Interior. Many more open data sets will be indexed and uploaded in the forthcoming weeks and months.

Besides focusing on the exploration and visualization of this test data set, we demonstrate the basic usage of {weatherData} to fetch historical weather data to R, {wbstats} to access the rich World Data Bank time series, and {ISOcodes} packages in R.

Some exploratory modeling (Negative Binomial Regression with glm.nb() and Ordinal Logistic Regression with clm() from {ordinal}) is exercised merely to assess the prima facie effects of the most influential factors.


Disclaimer. The Open Data Portal of the Republic of Serbia is a young initiative that is currently under development. Neither the owner of this GitHub account as an individual, or Data Science Serbia as an organization, hold any responsibility for the changes in the URLs of the data sets, or the changes in the content of the data sets published on the Open Data Portal of the Republic of Serbia. The results of the exploratory analyses and statistical models that are presented on this GitHub account are developed for illustrative purposes only, having in mind the goal of popularization of Open Data exclusively. The owner of this GitHub account strongly advises to consult him (e-mail: goran.s.milovanovic@gmail.com and Data Science Serbia before using the results presented here in public debate or media, and/or for any purposes other than motivating the usage and development of Open Data.


1. Setup

Load libraries + raw data:

rm(list = ls())
library(dplyr)
library(tidyr)
library(ggplot2)
library(knitr)
library(ggplot2)
library(ggmap)

### --- Working Directory
wDir <- '../TrafficAccidentsBGD2015'
setwd(wDir)

### --- Load Raw Data Set
fileLoc <- 'OPENDATA_SNEZGODE exc.csv'
rawData <- read.csv(fileLoc,
                    header = T,
                    check.names = F,
                    stringsAsFactors = F) 

### --- Inspect Data Set
dim(rawData)
## [1] 12873    11

Take a sneak peek at the data set:

glimpse(rawData)
## Observations: 12,873
## Variables: 11
## $ NEZG_ID   <int> 1049390, 1049255, 1041851, 1041992, 1042207, 1042241...
## $ VREME_NEZ <chr> "30.01.2015,17:45", "29.01.2015,15:30", "13.01.2015,...
## $ GPS_X     <int> 7460247, 7458383, 7461071, 7457669, 7454774, 7459982...
## $ GPS_Y     <int> 4957951, 4961042, 4963801, 4963440, 4962700, 4959325...
## $ VRSTA_NEZ <chr> "Sa povredjenim", "Sa mat.stetom", "Sa povredjenim",...
## $ NAZIV_TIP <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", ...
## $ NAZIV_DET <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", ...
## $ _COL8     <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ _COL9     <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ WGS_X     <dbl> 20.49236, 20.46856, 20.50232, 20.45934, 20.42281, 20...
## $ WGS_Y     <dbl> 44.76500, 44.79271, 44.81769, 44.81425, 44.80741, 44...

What are we looking at:

bgdMap <- get_map(location = 'Belgrade',
                  maptype = "roadmap")
ggmap(bgdMap, 
      extent = "device") + 
  geom_point(data = rawData,
             aes(x = WGS_X, y = WGS_Y), 
             size = .25, 
             color = "blue",
             alpha = .35) 
## Warning: Removed 220 rows containing missing values (geom_point).

With 220 points found outisde of the Belgrade Google Map area we’re already certain that some measurement error is present in the geolocalisation data. We’ll take care of it later.

Let’s take a quick look at the city core in a 2D density plot with geom_density2d from {ggplot2}:

bgdMap <- get_map(location = 'Kneza Milosa, Belgrade',
                  maptype = "roadmap",
                  zoom = 16)
ggmap(bgdMap, 
      extent = "device") + 
  geom_density2d(data = rawData,
                 aes(x = WGS_X, y = WGS_Y),
                 size = .1) +
  stat_density2d(data = rawData,
                 aes(x = WGS_X, y = WGS_Y, 
                     fill = ..level.., 
                     alpha = ..level..), 
                 size = 0.01, 
                 bins = 16, 
                 geom = "polygon") + 
  scale_fill_gradient(low = "white", high = "blue") + 
  scale_alpha(range = c(0, 0.3), guide = FALSE) + 
  theme(legend.position="none")
## Warning: Removed 12341 rows containing non-finite values (stat_density2d).

## Warning: Removed 12341 rows containing non-finite values (stat_density2d).

bgdMap <- get_map(location = 'Mostar, Belgrade',
                  maptype = "roadmap",
                  zoom = 16)
ggmap(bgdMap, 
      extent = "device") + 
  geom_density2d(data = rawData,
                 aes(x = WGS_X, y = WGS_Y),
                 size = .1) +
  stat_density2d(data = rawData, 
                 aes(x = WGS_X, y = WGS_Y, 
                     fill = ..level.., 
                     alpha = ..level..), 
                 size = 0.01, 
                 bins = 16, 
                 geom = "polygon") + 
  scale_fill_gradient(low = "white", high = "blue") + 
  scale_alpha(range = c(0, 0.3), guide = FALSE) +
  theme(legend.position="none")
## Warning: Removed 12563 rows containing non-finite values (stat_density2d).

## Warning: Removed 12563 rows containing non-finite values (stat_density2d).

Let’s start figuring out the variables:

# - VRSTA_NEZ
unique(rawData$VRSTA_NEZ)
## [1] "Sa povredjenim" "Sa mat.stetom"  "Sa poginulim"
table(rawData$VRSTA_NEZ)
## 
##  Sa mat.stetom   Sa poginulim Sa povredjenim 
##           9484             80           3309
# - NAZIV_TIP
unique(rawData$NAZIV_TIP)
## [1] ""                                                  
## [2] "SN SA NAJMANjE DVA VOZILA ??? SKRETANjE ILI PRELAZ"
## [3] "SN SA NAJMANjE DVA VOZILA ??? BEZ SKRETANjA"       
## [4] "SN SA JEDNIM VOZILOM"                              
## [5] "SN SA PARKIRANIM VOZILIMA"                         
## [6] "SN SA PE??ACIMA"
table(rawData$NAZIV_TIP)
## 
##                                                    
##                                              12634 
##                               SN SA JEDNIM VOZILOM 
##                                                 38 
##        SN SA NAJMANjE DVA VOZILA ??? BEZ SKRETANjA 
##                                                 89 
## SN SA NAJMANjE DVA VOZILA ??? SKRETANjE ILI PRELAZ 
##                                                 62 
##                          SN SA PARKIRANIM VOZILIMA 
##                                                 37 
##                                    SN SA PE??ACIMA 
##                                                 13
# - NAZIV_DET
head(unique(rawData$NAZIV_DET))
## [1] ""                                                                                                            
## [2] "Najmanje dva vozila koja se kre??u istim putem u suprotnim smerovima uz skretanje ulevo ispred drugog vozila"
## [3] "Ostale nezgode sa u??e????em najmanje dva vozila bez skretanja (bez podataka o smeru)"                       
## [4] "Ostale nezgode sa najmanje dva vozila koja se kre??u istim putem u istom smeru uz skretanje"                 
## [5] "Ostale nezgode sa najmanje dva vozila ??? suprotni smerovi bez skretanja"                                    
## [6] "Najmanje dva vozila koja se kre??u u istom smeru ??? sustizanje"
# - NAZIV_TIP vs. VRSTA_NEZ
kable(table(rawData$VRSTA_NEZ, rawData$NAZIV_TIP))
SN SA JEDNIM VOZILOM SN SA NAJMANjE DVA VOZILA ??? BEZ SKRETANjA SN SA NAJMANjE DVA VOZILA ??? SKRETANjE ILI PRELAZ SN SA PARKIRANIM VOZILIMA SN SA PE??ACIMA
Sa mat.stetom 9307 27 62 52 36 0
Sa poginulim 76 1 2 0 0 1
Sa povredjenim 3251 10 25 10 1 12

2. Data Wrangling: cleaning this up + some re-structuring for future use

Separate Date from Time:

# - Separate Date and Time
rawData <- separate(data = rawData,
                    col = VREME_NEZ,
                    into = c('Date', 'Time'),
                    sep = ',',
                    remove = F)

Separate Date to Day, Month, and Year:

# - Date --> Day, Month, Year
rawData <- separate(data = rawData,
                    col = Date,
                    into = c('Day', 'Month', 'Year'),
                    sep = '\\.',
                    remove = F)

Separate Time to Hour, Minute:

# - Time --> Hour, Minute
rawData <- separate(data = rawData,
                    col = Time,
                    into = c('Hour', 'Minute'),
                    sep = ':',
                    remove = F)

Add a new Date-Time format:

# - New Date-Time format:
rawData$DateTime <- paste(
  paste(rawData$Month,
        rawData$Day,
        rawData$Year,
        sep = '/'),
  paste(rawData$Hour,
        rawData$Min,
        sep = ":"),
  sep = ", "
)

And check:

# - Check Data Set:
kable(table(rawData$Year, rawData$Month))
01 02 03 04 05 06 07 08 09 10 11 12
2013 0 0 0 0 0 0 0 1 0 0 0 0
2014 0 1 0 1 1 0 1 0 0 1 0 0
2015 1042 1063 1168 1100 1206 1158 1104 1119 1311 1417 1178 1

There are some data points labeled by ‘2013’ and ‘2014’; probably errors; correct:

# - Year 2013, 2014: probably errors; correct
rawData[rawData$Year == '2014', 'Year'] <- '2015'
rawData[rawData$Year == '2013', 'Year'] <- '2015'

# - Check Data Set:
kable(table(rawData$Year, rawData$Month))
01 02 03 04 05 06 07 08 09 10 11 12
2015 1042 1064 1168 1101 1207 1158 1105 1120 1311 1418 1178 1

Very few data points for December; get rid of it:

# - December --> too few data points; drop:
rawData <- rawData[-which(rawData$Month == '12'), ]

And check:

# - Check Data Set:
kable(table(rawData$Year, rawData$Month))
01 02 03 04 05 06 07 08 09 10 11
2015 1042 1064 1168 1101 1207 1158 1105 1120 1311 1418 1178

What’s this:

# - Outcome: from VRSTA_NEZ
unique(rawData$VRSTA_NEZ)
## [1] "Sa povredjenim" "Sa mat.stetom"  "Sa poginulim"

Recode:

rawData$Outcome <- factor(rawData$VRSTA_NEZ)
levels(rawData$Outcome) <- c('Damage', 'Death','Injury')
levels(rawData$Outcome)
## [1] "Damage" "Death"  "Injury"

Turn into an ordered factor (possible use: ordinal logistic)

# - Outcome as ordered factor
rawData$Outcome <- factor(rawData$Outcome, 
                       levels = c('Damage', 'Injury', 'Death'),
                       ordered = T)

What is this + recode:

# - Type: from NAZIV_TIP
unique(rawData$NAZIV_TIP)
## [1] ""                                                  
## [2] "SN SA NAJMANjE DVA VOZILA ??? SKRETANjE ILI PRELAZ"
## [3] "SN SA NAJMANjE DVA VOZILA ??? BEZ SKRETANjA"       
## [4] "SN SA JEDNIM VOZILOM"                              
## [5] "SN SA PARKIRANIM VOZILIMA"                         
## [6] "SN SA PE??ACIMA"
rawData$Type <- factor(rawData$NAZIV_TIP)
levels(rawData$Type) <- c('None', 
                          '2VehTurnOrCross',
                          '2VehNoTurn',
                          '1Veh',
                          'parkedVeh',
                          'pedestrian')
levels(rawData$Type)
## [1] "None"            "2VehTurnOrCross" "2VehNoTurn"      "1Veh"           
## [5] "parkedVeh"       "pedestrian"

How many out of 12K+ recorded accidents were categorized:

# - how many accidents were categorized?
length(rawData$Type) - sum(rawData$Type == 'None')
## [1] 239

Geography:

# - Lat, Lon
rawData$Lat <- rawData$WGS_Y
rawData$Lon <- rawData$WGS_X

# - Check Geographical Data
# - Belgrade is on: Coordinates: 44°49′N 20°28′E
# - source: https://en.wikipedia.org/wiki/Belgrade
range(rawData$Lon)
## [1] 19.23912 22.80758
range(rawData$Lat)
## [1] 42.02545 45.11274

Is this cool? Probably not. Keep only data points between \(-3SD\) and \(+3SD\). N.B. This is not the right procedure to clean-up the data set. The correct procedure would involve having a vector map of Belgrade and then keeping only data points strictly found in the administratively defined region of interest.

# - filter
rawData <- rawData %>% 
  filter (Lon < mean(Lon) + 3*sd(Lon), 
          Lon > mean(Lon) -3*sd(Lon), 
          Lat < mean(Lat) + 3*sd(Lat), 
          Lat > mean(Lat) - 3*sd(Lat))

Sort:

# - Sort Data Set
rawData <- rawData[order(rawData$Day, 
                         rawData$Month,
                         rawData$Hour,
                         rawData$Minute), ]

dataSet will be the working data set:

### --- Working Data Set
dataSet <- rawData[, c('DateTime', 'Day', 'Month', 'Year',
                       'Time', 'Hour', 'Minute', 'Lat', 'Lon',
                       'Type', 'Outcome')]
rm(rawData)

Save:

### --- Save Working Data Set
### --- Working Directory
wDir <- '../TrafficAccidentsBGD2015'
setwd(wDir)
write.csv(dataSet, file = 'MUP2015_TrafficAccidentsBGD.csv')

3. Exploratory Data Analysis

Now, let’s visualize properly w. {leaflet}:

Accident Density

The density of traffic accidents is dynamically represents by markers in the Leaflet map. Click or zoom in too discover fine-grained information.

library(leaflet)
# - add an HTML descriptive column to dataSet
dataSet$Description <- paste(
  '<b>Date: </b>', paste(dataSet$Month, dataSet$Day, dataSet$Year, sep = "/"), '<br>',
  '<b>Time: </b>', paste(dataSet$Hour, dataSet$Minute, sep = ":"), '<br>',
  '<b>Outcome: </b>', dataSet$Outcome,
  sep = '')

accMap <- leaflet() %>%
  addTiles() %>%
  addMarkers(lng = dataSet$Lon,
             lat= dataSet$Lat,
             popup = dataSet$Description, 
             clusterOptions = markerClusterOptions())
accMap

Accident Severity

The severity of the traffic accident outcomes is represents by marker size and colors: yellow stands for ‘Damage’, orange for ‘Injury’, red for ‘Death’. Click or zoom in too discover fine-grained information.

# - Add Outcome Color
dataSet$OutcomeColor <- dataSet$Outcome
dataSet$OutcomeColor <- recode(dataSet$OutcomeColor, 
                               Damage = 'Yellow',
                               Injury = 'DarkOrange',
                               Death = 'Red')
# - Add Outcome Warning (Marker Size)
dataSet$OutcomeWarning <- dataSet$Outcome
dataSet$OutcomeWarning <- recode(dataSet$OutcomeWarning, 
                                 Damage = '10',
                                 Injury = '15',
                                 Death = '20')
dataSet$OutcomeWarning <- as.numeric(dataSet$OutcomeWarning)
  
accMap <- leaflet() %>%
  addTiles(urlTemplate = "http://{s}.tiles.wmflabs.org/bw-mapnik/{z}/{x}/{y}.png", 
           attribution = '&copy; <a href="http://www.openstreetmap.org/copyright">OpenStreetMap</a>') %>% 
  addCircleMarkers(lng = dataSet$Lon,
                   lat= dataSet$Lat,
                   radius = dataSet$OutcomeWarning*2,
                   popup = dataSet$Description,
                   fill = T,
                   fillColor = dataSet$OutcomeColor,
                   stroke = F,
                   fillOpacity = .5)
accMap

What was the probability of a traffic accident resulting in injury in Belgrade, 2015? N.B. That is the conditional probability, \(P(Injury|Road Accident)\):

outcomes <- table(dataSet$Outcome)
pInjury <- outcomes['Injury']/sum(outcomes)
pInjury
##  Injury 
## 0.25296

That is some 25%:

pInjury*100
## Injury 
## 25.296

What was the probability of a traffic accident with fatal consequences in Belgrade, 2015? N.B. That is the conditional probability, \(P(Fatal|Road Accident)\):

outcomes <- table(dataSet$Outcome)
pDeath <- unname(outcomes['Death']/sum(outcomes))
pDeath
## [1] 0.00488

which is less than a half percent:

pDeath*100
## [1] 0.488

Reminder: We do not know what is the probability of being engaged in a traffic accident, so do not rush with associating emotional responses to these probabilities…

Accidents by months:

byMonth <- dataSet %>%
  group_by(Month) %>%
  summarise(Accidents = n())
byMonth$Month <- as.numeric(byMonth$Month)
# byMonth$Month <- factor(byMonth$Month, levels <- byMonth$Month)
ggplot(byMonth, aes(x = Month, y = Accidents)) + 
  geom_line(color = "blue", size = .5) + 
  geom_point(color = "blue", size = 1.5) + 
  geom_point(color = "white", size = 1) + 
  scale_x_continuous(breaks = byMonth$Month,
                   labels = month.abb[as.numeric(byMonth$Month)]) +
  ggtitle('Belgrade, 2015: Traffic Accidents by Month') +
  ylim(1000, max(byMonth$Accidents)+100) +
  theme_bw() +
  theme(plot.title = element_text(size = 11))

Accidents by hour:

byHour <- dataSet %>%
  group_by(Hour) %>%
  summarise(Accidents = n())
byHour$HourData <- as.numeric(byHour$Hour)
ggplot(byHour, aes(y = Accidents, x = HourData)) + 
  geom_line(color = "blue", size = .5) + 
  geom_point(color = "blue", size = 1.5) + 
  geom_point(color = "white", size = 1) + 
  scale_x_continuous(breaks = byHour$HourData-.5,
                   labels = byHour$Hour) +
  ggtitle('Belgrade, 2015: Traffic Accidents by Hour') + 
  xlab('Hour') +
  theme_bw() +
  theme(plot.title = element_text(size = 11))

Accident outcome vs. Month:

dataSet$MonthLabel <- month.abb[as.numeric(dataSet$Month)]
plot(table(dataSet$MonthLabel, dataSet$Outcome),
     main = 'Accident Outcomes per Month',
     col = "gold")

No pattern seems to be present:

testAccMonth <- as.matrix(table(dataSet$MonthLabel, dataSet$Outcome))
kable(testAccMonth)
Damage Injury Death
Apr 816 265 3
Aug 788 276 4
Feb 804 243 6
Jan 767 256 5
Jul 782 283 7
Jun 802 322 6
Mar 859 298 3
May 859 311 8
Nov 870 252 4
Oct 1017 333 4
Sep 913 323 11

To confirm, \({\chi}^2\)-test for contingency tables:

chisq.test(testAccMonth, 
           simulate.p.value = T)
Pearson's Chi-squared test with simulated p-value (based on 2000
replicates)

data: testAccMonth X-squared = 26.791, df = NA, p-value = 0.1354

When do the fatal accidents take place?

fatalSet <- dataSet %>% 
  filter(Outcome == 'Death') %>% 
  group_by(Hour) %>% 
  summarise(Accidents = n()) %>% 
  t()
colnames(fatalSet) <- fatalSet[1,]
fatalSet <- fatalSet[-1,]
kable(t(fatalSet), caption = 'Fatal Accidents by Hour:')
Fatal Accidents by Hour:
00 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 21 22 23
3 1 4 2 3 1 3 4 2 3 3 6 2 3 2 8 1 1 2 4 3

Is there a pattern in the hourly distribution of the accident outcome?

byHourSet <- dataSet %>%
  group_by(Hour, Outcome) %>% 
  tally() %>%
  mutate(Percent = round(n/sum(n)*100,2))
ggplot(data = byHourSet, 
       aes(x = Hour, y = Percent, color = Outcome, fill = Outcome)) +
  geom_bar(stat = "identity", position = "stack", width = .5) + 
  scale_fill_brewer(palette = "Blues") +
  scale_color_brewer(palette = "Blues") +
  theme_bw()

Any indication of a relationship? We already know that there will missing data for fatal accidents:

byHourSet <- dataSet %>%
  filter(!(Outcome == 'Death')) %>% 
  group_by(Hour, Outcome) %>% 
  tally() %>%
  spread(key = Outcome, value = n) %>% 
  ungroup() %>% dplyr::select(-Hour)
  chisq.test(byHourSet, 
             simulate.p.value = T)
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  byHourSet
## X-squared = 49.848, df = NA, p-value = 0.0004998

Hm.

byHourSet <- dataSet %>%
  group_by(Hour, Outcome) %>% 
  tally() %>%
  mutate(Percent = round(n/sum(n)*100,2)) %>%
  filter(Outcome == 'Injury')
ggplot(data = byHourSet, 
       aes(x = as.numeric(Hour), y = Percent)) +
  geom_path(color = 'blue') +
  geom_point(size = 1.5, color = 'blue') +
  geom_point(size = 1, color = 'white') +
  xlab('Hour') + ylim(20,35) +
  ggtitle('Percent of Accidents w. Injuries per Hour') +
  scale_x_continuous(breaks = as.numeric(byHourSet$Hour),
                     labels = as.character(0:23)) +
  theme_bw()

Let’s inspect the distribution of traffic accidents per day:

dataSet <- unite(data = dataSet,
                 col = Date,
                 Year, Month, Day,
                 sep = "-")

Check whether there are any days with no accidents that are not in December 2015. for which we have no data:

distData <- dataSet %>% 
  group_by(Date) %>%
  summarise(Frequency = n())
distData$Date <- as.Date(distData$Date)
dates2015 <- seq(as.Date("2015/1/1"), as.Date("2015/12/31"), "days")
w <- which(dates2015 %in% distData$Date)
obs <- rep(0, length(dates2015))
obs[w] <- distData$Frequency
distData <- data.frame(
  Date = dates2015,
  Frequency = obs)
length(which(distData$Frequency == 0)) # December data, right
## [1] 31
which(distData$Frequency == 0) # December data, right
##  [1] 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351
## [18] 352 353 354 355 356 357 358 359 360 361 362 363 364 365
distData <- distData[distData$Frequency > 0, ]
ggplot(distData, aes(x = Frequency)) +
  geom_bar(stat = 'count', fill = "blue", alpha = .35, width = .5) +
  xlab("Number of Accidents Per Day") + ylab("Frequency") + 
  ggtitle('Belgrade, 2015: Traffic Accidents Daily Frequency') +
  theme_bw()

Could this be a Poisson distribution?

sampleStats <- lapply(1:10000, function(x) {
  l <- list()
  s <- sample(distData$Frequency, 50, replace = T)
  l$mean <- mean(s)
  l$var <- var(s)
  l
})
meanStats <- sapply(sampleStats, function(x) x$mean)
varStats <- sapply(sampleStats, function(x) x$var)
poissonFrame <- data.frame(mean = meanStats,
                           var = varStats)
ggplot(poissonFrame, aes(x = mean, y = var)) +
  geom_point(size = .1, color = "black") +
  geom_smooth(method = "lm", size = .25, se = T, color = "blue", alpha = .35) +
  theme_bw()

I don’t think so.

4. Data Enrichment

Is there anything that could help us predict the number of accidents per day? Here’s a proposal: let’ grab the weather data for Belgrade, 2015/01/01 - 2015/11/01, and see if there’s anything potentially useful in the data set:

(NOTE: bgdWeather2015.csv was obtained from the following lines that do not evaluate in this version of the notebook; the data are already prepared and stored locally.)

library(weatherData)
checkBGD <- checkSummarizedDataAvailability(station_id = 'LYBE', 
                                            start_date = '2015-01-01', 
                                            end_date = '2015-11-30',
                                            station_type = 'id')
## Retrieving from: http://www.wunderground.com/history/airport/LYBE/2015/1/1/CustomHistory.html?dayend=30&monthend=11&yearend=2015&req_city=NA&req_state=NA&req_statename=NA&format=1
## URL does not seem to exist: http://www.wunderground.com/history/airport/LYBE/2015/1/1/CustomHistory.html?dayend=30&monthend=11&yearend=2015&req_city=NA&req_state=NA&req_statename=NA&format=1
## The original error message:
## Error in readLines(u): cannot open the connection

After checking the reported URL in the browser, it turns out the data are there in fact:

bgd2015URL <- 'https://www.wunderground.com/history/airport/LYBE/2015/1/1/CustomHistory.html?dayend=30&monthend=11&yearend=2015&req_city=NA&req_state=NA&req_statename=NA&format=1'
bgdWeather2015 <- read.csv(bgd2015URL,
                           header  = T,
                           stringsAsFactors = F,
                           check.names = F)
# clean-up a bit:
colnames(bgdWeather2015) <- gsub("<br />","", colnames(bgdWeather2015), fixed = T)
bgdWeather2015$WindDirDegrees <- gsub("<br />","", bgdWeather2015$WindDirDegrees, fixed = T)
kable(head(bgdWeather2015))

Let’s clean up this, CloudCover first:

table(bgdWeather2015$CloudCover)
bgdWeather2015$CloudCover[which(is.na(bgdWeather2015$CloudCover))] <- 0

Events:

table(bgdWeather2015$Events)
bgdWeather2015$Events[!grepl("^[[:alpha:]]", bgdWeather2015$Events)] <- 'None'

Max Gust SpeedMPH goes to Yes/No:

table(bgdWeather2015$`Max Gust SpeedMPH`)
bgdWeather2015$WindGust <- ifelse(is.na(bgdWeather2015$`Max Gust SpeedMPH`), 'No', 'Yes')

Fix the CET column to match our Date (YYYY-MM-DD format):

bgdWeather2015$CET <- sapply(bgdWeather2015$CET, function(x) {
  x <- strsplit(x, split = '-', fixed = T)
  if (nchar(x[[1]][2])<2) {
    x[[1]][2] <- paste0('0', x[[1]][2])
  }
  if (nchar(x[[1]][3])<2) {
    x[[1]][3] <- paste0('0', x[[1]][3])
  }
  return(paste(unlist(x), collapse = "-"))
})

And save:

write.csv(bgdWeather2015, 'bgdWeather2015.csv')

(NOTE: loading bgdWeather2015.csv here). Now left join bgdWeather2015 to distData and prepare for modeling:

bgdWeather2015 <- read.csv('bgdWeather2015.csv',
                           header = T,
                           stringsAsFactors = F,
                           check.names = F,
                           row.names = 1)
bgdWeather2015$CET <- as.Date(bgdWeather2015$CET)
modelSet <- left_join(distData, bgdWeather2015,
                      by = c("Date" = "CET"))
modelSet$`Max Gust SpeedMPH` <- NULL
modelSet$Events <- factor(modelSet$Events)
levels(modelSet$Events)
##  [1] "Fog"                    "Fog-Rain"              
##  [3] "Fog-Rain-Thunderstorm"  "None"                  
##  [5] "Rain"                   "Rain-Hail-Thunderstorm"
##  [7] "Rain-Snow"              "Rain-Thunderstorm"     
##  [9] "Snow"                   "Thunderstorm"
levels(modelSet$Events) <- c('None', setdiff(levels(modelSet$Events), 'None'))
levels(modelSet$WindGust) <- c('No', 'Yes')

Remove spaces in column names:

colnames(modelSet) <- gsub("\\s+", "", colnames(modelSet))

Fix $WindDirDegrees which is a character:

modelSet$WindDirDegrees <- as.numeric(modelSet$WindDirDegrees)

and save:

write.csv(modelSet, 'modelDataSet.csv')

5. Modeling

Add some new features:

modelSet$DaySeverity <- cut(modelSet$Frequency,
                            breaks = quantile(modelSet$Frequency),
                            right = T,
                            include.lowest = T,
                            labels = as.numeric(1:4))
table(modelSet$DaySeverity)
## 
##  1  2  3  4 
## 89 85 81 79

modelSet$DaySeverity becomes an ordered factor:

modelSet$DaySeverity <- factor(modelSet$DaySeverity, ordered = T)
tail(modelSet$DaySeverity)
## [1] 1 4 2 1 1 1
## Levels: 1 < 2 < 3 < 4

Introduce DayofWeek feature:

modelSet$DayofWeek <- factor(weekdays(modelSet$Date))
levels(modelSet$DayofWeek) <- c('Monday','Tuesday','Wednesday',
                                'Thursday', 'Friday', 'Saturday',
                                'Sunday')

Inspect the effect of DayofWeek:

dayFrame <- modelSet %>% 
  group_by(DayofWeek) %>%
  summarise(Frequency = sum(Frequency)) %>% 
  arrange(desc(Frequency))
kable(dayFrame)
DayofWeek Frequency
Monday 2055
Friday 1982
Tuesday 1909
Saturday 1882
Sunday 1827
Wednesday 1607
Thursday 1238

Relevel DayofWeek:

modelSet$DayofWeek <- factor(modelSet$DayofWeek, 
                             levels = dayFrame$DayofWeek[order(dayFrame$Frequency)])
levels(modelSet$DayofWeek)
## [1] "Thursday"  "Wednesday" "Sunday"    "Saturday"  "Tuesday"   "Friday"   
## [7] "Monday"

Inspect Events:

eventFrame <- modelSet %>%
  group_by(Events) %>%
  summarise(Frequency = sum(Frequency)) %>% 
  arrange(desc(Frequency))
kable(eventFrame)
Events Frequency
Fog-Rain-Thunderstorm 6079
Rain 2850
None 1312
Rain-Thunderstorm 705
Fog 535
Rain-Snow 424
Snow 248
Thunderstorm 190
Fog-Rain 80
Rain-Hail-Thunderstorm 77

Relevel Events:

modelSet$Events <- factor(modelSet$Events,
                          levels = eventFrame$Events[order(eventFrame$Frequency)])
levels(modelSet$Events)
##  [1] "Rain-Hail-Thunderstorm" "Fog-Rain"              
##  [3] "Thunderstorm"           "Snow"                  
##  [5] "Rain-Snow"              "Fog"                   
##  [7] "Rain-Thunderstorm"      "None"                  
##  [9] "Rain"                   "Fog-Rain-Thunderstorm"

5A. GLM: Negative Binomial Regression: Accident Frequency

# - model: Poisson GLM
mData <- modelSet %>% 
  dplyr::select(-Date, -starts_with("Max"), -starts_with("Min"), -DaySeverity)
mData %>% 
  group_by(DayofWeek) %>%
  summarise(Mean = round(mean(Frequency),2), 
            Var = round(var(Frequency),2)) %>%
  kable()
DayofWeek Mean Var
Thursday 25.79 48.17
Wednesday 33.48 65.49
Sunday 38.87 77.98
Saturday 40.04 58.00
Tuesday 39.77 102.78
Friday 41.29 113.53
Monday 42.81 86.67

Overdispersion present; use Negative Binomial Regression in place of Poisson for count data:

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
mFit <- glm.nb(Frequency ~ .,
               data = mData)
# - inspect model
summary(mFit)
## 
## Call:
## glm.nb(formula = Frequency ~ ., data = mData, init.theta = 43.86777439, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.4603  -0.6858  -0.0353   0.5739   4.0627  
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  4.2056174  2.0956883   2.007  0.04477 *  
## MeanTemperatureF             0.0008798  0.0083022   0.106  0.91560    
## MeanDewPointF                0.0026903  0.0090130   0.298  0.76533    
## MeanHumidity                 0.0041437  0.0040686   1.018  0.30845    
## MeanSeaLevelPressureIn      -0.0481397  0.0673899  -0.714  0.47501    
## MeanVisibilityMiles         -0.0044484  0.0089429  -0.497  0.61889    
## MeanWindSpeedMPH             0.0091200  0.0044751   2.038  0.04156 *  
## PrecipitationIn             -0.0236448  0.0084898  -2.785  0.00535 ** 
## CloudCover                  -0.0172911  0.0103561  -1.670  0.09499 .  
## EventsFog-Rain               0.3019344  0.2255759   1.339  0.18073    
## EventsThunderstorm           0.0888676  0.1906498   0.466  0.64112    
## EventsSnow                   0.2250424  0.1870251   1.203  0.22887    
## EventsRain-Snow              0.1188256  0.1788000   0.665  0.50632    
## EventsFog                    0.0398455  0.1722310   0.231  0.81704    
## EventsRain-Thunderstorm      0.0170324  0.1675080   0.102  0.91901    
## EventsNone                   0.1351256  0.1652182   0.818  0.41344    
## EventsRain                   0.0577351  0.1615937   0.357  0.72088    
## EventsFog-Rain-Thunderstorm  0.0620189  0.1620782   0.383  0.70198    
## WindDirDegrees              -0.0001668  0.0001361  -1.225  0.22054    
## WindGustYes                  0.0527579  0.0517900   1.019  0.30835    
## DayofWeekWednesday           0.2563737  0.0496790   5.161 2.46e-07 ***
## DayofWeekSunday              0.4164511  0.0493509   8.439  < 2e-16 ***
## DayofWeekSaturday            0.4532986  0.0488916   9.271  < 2e-16 ***
## DayofWeekTuesday             0.4433796  0.0494880   8.959  < 2e-16 ***
## DayofWeekFriday              0.4662585  0.0490481   9.506  < 2e-16 ***
## DayofWeekMonday              0.5058560  0.0487037  10.386  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(43.8678) family taken to be 1)
## 
##     Null deviance: 546.63  on 333  degrees of freedom
## Residual deviance: 344.91  on 308  degrees of freedom
## AIC: 2413.7
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  43.87 
##           Std. Err.:  7.56 
## 
##  2 x log-likelihood:  -2359.659

Predict:

predictFrame <- data.frame(Observation = rep(1:length(mData$Frequency), 2),
                           Value = c(predict(mFit, type = "response"),
                                     mData$Frequency),
                           Type = c(rep("Predicted", length(mData$Frequency)),
                                    rep("Observed", length(mData$Frequency))),
                           stringsAsFactors = F)
ggplot(predictFrame, aes(x = Observation, y = Value, group = Type, color = Type)) + 
  geom_line(size = .15) + 
  scale_color_manual(values = c("lightblue", "black")) +
  theme_bw() + 
  theme(legend.key = element_blank()) +
  theme(legend.title = element_blank()) +
  ggtitle("GLM Negative Binomial: Predicted vs. Observed")

predictFrame <- spread(predictFrame,
                       key = Type,
                       value = Value)
ggplot(predictFrame, aes(x = Predicted, y = Observed)) + 
  geom_point(size = .5, color = "blue") + 
  theme_bw() + 
  ggtitle("GLM Negative Binomial: Predicted vs. Observed")

# - coefficients
mCoeff <- as.data.frame(summary(mFit)$coefficients)
w <- which(mCoeff$`Pr(>|z|)` < .05)
rownames(mCoeff)[w]
## [1] "(Intercept)"        "MeanWindSpeedMPH"   "PrecipitationIn"   
## [4] "DayofWeekWednesday" "DayofWeekSunday"    "DayofWeekSaturday" 
## [7] "DayofWeekTuesday"   "DayofWeekFriday"    "DayofWeekMonday"
w <- setdiff(w, 1) # - supress the intercept from output
important <- data.frame(predictor = rownames(mCoeff)[w],
                        coefficient = round(exp(mCoeff$Estimate)[w], 2),
                        stringsAsFactors = F)
kable(important[order(-important$coefficient), ])
predictor coefficient
8 DayofWeekMonday 1.66
7 DayofWeekFriday 1.59
5 DayofWeekSaturday 1.57
6 DayofWeekTuesday 1.56
4 DayofWeekSunday 1.52
3 DayofWeekWednesday 1.29
1 MeanWindSpeedMPH 1.01
2 PrecipitationIn 0.98

5B. Ordinal Logistic Regression: Day Severity (i.e. Frequency categorized)

mData <- modelSet %>% 
  dplyr::select(-Date, -starts_with("Max"), -starts_with("Min"), -Frequency)
library(ordinal)
## 
## Attaching package: 'ordinal'
## The following object is masked from 'package:dplyr':
## 
##     slice
# - model
mFit <- clm(DaySeverity ~ .,
            data = mData)
## Warning: (3) Model is nearly unidentifiable: large eigenvalue ratio
##  - Rescale variables? 
## In addition: Absolute and relative convergence criteria were met
# - inspect model
summary(mFit)
## formula: 
## DaySeverity ~ MeanTemperatureF + MeanDewPointF + MeanHumidity + MeanSeaLevelPressureIn + MeanVisibilityMiles + MeanWindSpeedMPH + PrecipitationIn + CloudCover + Events + WindDirDegrees + WindGust + DayofWeek
## data:    mData
## 
##  link  threshold nobs logLik  AIC    niter max.grad cond.H 
##  logit flexible  334  -395.53 847.06 5(0)  3.64e-12 4.6e+09
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## MeanTemperatureF             0.0511424  0.0700945   0.730 0.465622    
## MeanDewPointF               -0.0311045  0.0760710  -0.409 0.682623    
## MeanHumidity                 0.0627742  0.0349019   1.799 0.072083 .  
## MeanSeaLevelPressureIn       0.1399800  0.5724987   0.245 0.806838    
## MeanVisibilityMiles          0.0192045  0.0747324   0.257 0.797197    
## MeanWindSpeedMPH             0.0819241  0.0396896   2.064 0.039006 *  
## PrecipitationIn             -0.1654075  0.1238633  -1.335 0.181744    
## CloudCover                  -0.1876122  0.0900082  -2.084 0.037125 *  
## EventsFog-Rain               2.3429351  1.7404577   1.346 0.178251    
## EventsThunderstorm           1.1529011  1.5487647   0.744 0.456634    
## EventsSnow                   0.8726799  1.5059328   0.579 0.562255    
## EventsRain-Snow              0.7597486  1.4504593   0.524 0.600419    
## EventsFog                   -0.0775333  1.3813141  -0.056 0.955238    
## EventsRain-Thunderstorm      0.2210704  1.3414142   0.165 0.869098    
## EventsNone                   0.9661348  1.3125211   0.736 0.461675    
## EventsRain                   0.3228195  1.2845021   0.251 0.801568    
## EventsFog-Rain-Thunderstorm  0.3013201  1.2898518   0.234 0.815289    
## WindDirDegrees              -0.0008615  0.0011769  -0.732 0.464173    
## WindGustYes                  0.5221284  0.4314099   1.210 0.226170    
## DayofWeekWednesday           1.5261618  0.4421710   3.452 0.000557 ***
## DayofWeekSunday              2.9207413  0.4598398   6.352 2.13e-10 ***
## DayofWeekSaturday            3.2456627  0.4608758   7.042 1.89e-12 ***
## DayofWeekTuesday             2.9690228  0.4509963   6.583 4.60e-11 ***
## DayofWeekFriday              3.1912727  0.4556060   7.004 2.48e-12 ***
## DayofWeekMonday              3.6822758  0.4602831   8.000 1.24e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Threshold coefficients:
##     Estimate Std. Error z value
## 1|2    11.65      17.78   0.655
## 2|3    13.18      17.78   0.742
## 3|4    14.52      17.78   0.817

Refine:

mCoeff <- as.data.frame(summary(mFit)$coefficients)
w <- which(mCoeff$`Pr(>|z|)` < .05)
w <- setdiff(w, 1:3) # supress the intercepts from the output
rownames(mCoeff)[w]
## [1] "MeanWindSpeedMPH"   "CloudCover"         "DayofWeekWednesday"
## [4] "DayofWeekSunday"    "DayofWeekSaturday"  "DayofWeekTuesday"  
## [7] "DayofWeekFriday"    "DayofWeekMonday"
mData <- modelSet %>% 
  dplyr::select(DaySeverity, MeanWindSpeedMPH, CloudCover, DayofWeek)
# - model
mFit <- clm(DaySeverity ~ .,
            data = mData)
# - inspect model
summary(mFit)
## formula: DaySeverity ~ MeanWindSpeedMPH + CloudCover + DayofWeek
## data:    mData
## 
##  link  threshold nobs logLik  AIC    niter max.grad cond.H 
##  logit flexible  334  -411.57 845.13 4(0)  3.88e-07 7.5e+03
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## MeanWindSpeedMPH   0.034220   0.031119   1.100 0.271493    
## CloudCover         0.004032   0.044188   0.091 0.927291    
## DayofWeekWednesday 1.558539   0.427319   3.647 0.000265 ***
## DayofWeekSunday    2.787103   0.439185   6.346 2.21e-10 ***
## DayofWeekSaturday  2.973880   0.439313   6.769 1.29e-11 ***
## DayofWeekTuesday   2.806974   0.430021   6.528 6.69e-11 ***
## DayofWeekFriday    3.077950   0.436535   7.051 1.78e-12 ***
## DayofWeekMonday    3.489330   0.442451   7.886 3.11e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Threshold coefficients:
##     Estimate Std. Error z value
## 1|2   1.4023     0.4004   3.502
## 2|3   2.8277     0.4213   6.712
## 3|4   4.0895     0.4385   9.326
mCoeff <- as.data.frame(summary(mFit)$coefficients)
w <- which(mCoeff$`Pr(>|z|)` < .05)
w <- setdiff(w, 1:3) # - supress the intercepts from output
important <- data.frame(predictor = rownames(mCoeff)[w],
                        coefficient = round(exp(mCoeff$Estimate)[w], 2),
                        stringsAsFactors = F)
kable(important[order(-important$coefficient), ])
predictor coefficient
6 DayofWeekMonday 32.76
5 DayofWeekFriday 21.71
3 DayofWeekSaturday 19.57
4 DayofWeekTuesday 16.56
2 DayofWeekSunday 16.23
1 DayofWeekWednesday 4.75
# - e.g. Hit rate (Accuracy of Classification)
countHits <- sum(
  as.numeric(predict(mFit, type = "class")$fit) == as.numeric(modelSet$DaySeverity))
dataPoints <- length(modelSet$DaySeverity)
hitRate <- countHits/dataPoints
paste0("Correct classification rate: ", round(hitRate*100, 2), "%")
## [1] "Correct classification rate: 40.42%"

6. Air vs Road Transportation Safety

According to the Aviation Safety Network’s data, there were 16 fatal airliner accidents, and 560 airliner accidents in total in 2015. Source: Aviation Safety Network.

How many air carrier departures took place in 2015? We will use the {wbstats} package to access the World Data Bank to find out. The primary data source for this indicator is the International Civil Aviation Organization; the indicator can be accessed from World Data Bank. The indicator code in World Data Bank is: IS.AIR.DPRT (Description: Registered carrier departures worldwide are domestic takeoffs and takeoffs abroad of air carriers registered in the country.).

# Indicator: IS.AIR.DPRT
# - access World Data Bank
library(wbstats)
airDeps <- data.frame(wb(
  indicator = 'IS.AIR.DPRT'
))
kable(head(airDeps))
value date indicatorID indicator iso2c country
2 1342831 2015 IS.AIR.DPRT Air transport, registered carrier departures worldwide 1A Arab World
3 1292064 2014 IS.AIR.DPRT Air transport, registered carrier departures worldwide 1A Arab World
4 1222184 2013 IS.AIR.DPRT Air transport, registered carrier departures worldwide 1A Arab World
5 1159026 2012 IS.AIR.DPRT Air transport, registered carrier departures worldwide 1A Arab World
6 1093319 2011 IS.AIR.DPRT Air transport, registered carrier departures worldwide 1A Arab World
7 1065038 2010 IS.AIR.DPRT Air transport, registered carrier departures worldwide 1A Arab World

Not all codes in the iso2c field represent world countries, meaning: the data must be overlapping:

unique(airDeps$iso2c)
##   [1] "1A" "S3" "B8" "V2" "Z4" "4E" "T4" "XC" "Z7" "7E" "T7" "EU" "F1" "XE"
##  [15] "XD" "XF" "ZT" "XH" "XI" "XG" "V3" "ZJ" "XJ" "T2" "XL" "XO" "XM" "XN"
##  [29] "ZQ" "XQ" "T3" "XP" "XU" "OE" "S4" "S2" "V4" "V1" "S1" "8S" "T5" "ZG"
##  [43] "ZF" "T6" "XT" "1W" "AF" "AL" "DZ" "AS" "AO" "AG" "AR" "AM" "AU" "AT"
##  [57] "AZ" "BS" "BH" "BD" "BB" "BY" "BE" "BZ" "BJ" "BT" "BO" "BA" "BW" "BR"
##  [71] "BN" "BG" "BF" "BI" "CV" "KH" "CM" "CA" "CF" "TD" "CL" "CN" "CO" "KM"
##  [85] "CD" "CG" "CR" "CI" "HR" "CU" "CY" "CZ" "DK" "DJ" "DO" "EC" "EG" "SV"
##  [99] "GQ" "ER" "EE" "ET" "FJ" "FI" "FR" "GA" "GM" "GE" "DE" "GH" "GR" "GU"
## [113] "GT" "GN" "GW" "GY" "HT" "HN" "HK" "HU" "IS" "IN" "ID" "IR" "IQ" "IE"
## [127] "IL" "IT" "JM" "JP" "JO" "KZ" "KE" "KI" "KP" "KR" "KW" "KG" "LA" "LV"
## [141] "LB" "LS" "LR" "LY" "LT" "LU" "MO" "MK" "MG" "MW" "MY" "MV" "ML" "MT"
## [155] "MH" "MR" "MU" "MX" "MD" "MC" "MN" "ME" "MA" "MZ" "MM" "NA" "NR" "NP"
## [169] "NL" "NZ" "NI" "NE" "NG" "NO" "OM" "PK" "PA" "PG" "PY" "PE" "PH" "PL"
## [183] "PT" "QA" "RO" "RU" "RW" "WS" "ST" "SA" "SN" "RS" "SC" "SL" "SG" "SK"
## [197] "SI" "SB" "SO" "ZA" "ES" "LK" "SD" "SR" "SZ" "SE" "CH" "SY" "TJ" "TZ"
## [211] "TH" "TG" "TO" "TT" "TN" "TR" "TM" "UG" "UA" "AE" "GB" "US" "UY" "UZ"
## [225] "VU" "VE" "VN" "YE" "ZM" "ZW"

We need to extract only country data; load the {ISOcodes} package to access ISO3166 country codes:

library(ISOcodes)
data("ISO_3166_1")
head(ISO_3166_1$Alpha_2, 30)
##  [1] "AW" "AF" "AO" "AI" "AX" "AL" "AD" "AE" "AR" "AM" "AS" "AQ" "TF" "AG"
## [15] "AU" "AT" "AZ" "BI" "BE" "BJ" "BQ" "BF" "BD" "BG" "BH" "BS" "BA" "BL"
## [29] "BY" "BZ"

Extract only country data from airDeps:

totFlights <- airDeps %>%
  filter(iso2c %in% ISO_3166_1$Alpha_2, date == '2015')
print(paste0('Countries: ', length(unique(totFlights$iso2c))))

[1] “Countries: 159”

Let’s have a look at the data set now:

kable(head(totFlights, 20))
value date indicatorID indicator iso2c country
23532.528 2015 IS.AIR.DPRT Air transport, registered carrier departures worldwide AF Afghanistan
0.000 2015 IS.AIR.DPRT Air transport, registered carrier departures worldwide AL Albania
68096.268 2015 IS.AIR.DPRT Air transport, registered carrier departures worldwide DZ Algeria
11759.184 2015 IS.AIR.DPRT Air transport, registered carrier departures worldwide AS American Samoa
13116.000 2015 IS.AIR.DPRT Air transport, registered carrier departures worldwide AO Angola
40046.652 2015 IS.AIR.DPRT Air transport, registered carrier departures worldwide AG Antigua and Barbuda
145583.336 2015 IS.AIR.DPRT Air transport, registered carrier departures worldwide AR Argentina
0.000 2015 IS.AIR.DPRT Air transport, registered carrier departures worldwide AM Armenia
658699.000 2015 IS.AIR.DPRT Air transport, registered carrier departures worldwide AU Australia
152056.520 2015 IS.AIR.DPRT Air transport, registered carrier departures worldwide AT Austria
18199.000 2015 IS.AIR.DPRT Air transport, registered carrier departures worldwide AZ Azerbaijan
31211.184 2015 IS.AIR.DPRT Air transport, registered carrier departures worldwide BS Bahamas, The
57444.000 2015 IS.AIR.DPRT Air transport, registered carrier departures worldwide BH Bahrain
37219.116 2015 IS.AIR.DPRT Air transport, registered carrier departures worldwide BD Bangladesh
22939.000 2015 IS.AIR.DPRT Air transport, registered carrier departures worldwide BY Belarus
138446.155 2015 IS.AIR.DPRT Air transport, registered carrier departures worldwide BE Belgium
120072.361 2015 IS.AIR.DPRT Air transport, registered carrier departures worldwide BZ Belize
1208.520 2015 IS.AIR.DPRT Air transport, registered carrier departures worldwide BJ Benin
4640.388 2015 IS.AIR.DPRT Air transport, registered carrier departures worldwide BT Bhutan
29717.430 2015 IS.AIR.DPRT Air transport, registered carrier departures worldwide BO Bolivia

Good. How many flights there were in 2015?

sum(totFlights$value)
## [1] 32960402

This number corresponds with the World Data Bank report on the IS.AIR.DPRT indicator: check out.

What was the probability of boarding a flight that will end in a fatal accident in 2015? Let’s see: there were 16 flights (Aviation Safety Network data) that had a fatal outcome in 2015, and 32960402 flights in total (World Bank Data):

pAirFatal <- 16/32960402
pAirFatal
## [1] 4.85431e-07
print(paste0("Percent Fatal: ", pAirFatal*100))
## [1] "Percent Fatal: 4.85430972595541e-05"

Goran S. Milovanovic, Data Science Serbia, 01/24/2017, Belgrade, Serbia